-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS_MAVLink: respect txbuf flow control for FTP messages #26098
Conversation
I wonder if it's a good idea to put all mftp replies under txbuf control, instead of only those of the read/readburst stream. Putting also the mftp replies for "negotiating" under control may make it harder for gcses to get their timeouts right. |
@olliw42 Did you happen to test this? (I ask just so we can know what you saw if you did.) Either way, I think I understand why you might wonder. It does sometimes (less than 10% in my recent tests) happen that we wait so long for buffer space that the GCS times out and ends up re-sending the first FTP message (OpenFileRO in the case of QGC). I have some thoughts about this.
Instead, I think it might try limiting the delay due to txbuf to about 500ms for all message types. If we don't have more than .5 seconds worth of data buffered, that might help. This would, however never work with a radio bandwidth of less than about 300 bytes/second. Note that mLRS (which currently has at least 1500 Bytes/sec). Of course, the current GCS timeouts would also need to change for lower bandwidth and higher latency. |
I've made several improvements in the latest commit and I think I've addressed all the comments. This is working even better in my tests. |
Tested with:
Params load substantially faster than with individual param polling (as expected) on both rates. Flashed the FC back to mainline and confirmed that the original issues still exist: FTP param load fails (due to overloading the buffer in ELRS) and it falls back to polled param mode. Whats left to get this merged? |
Looking at this slightly differently - I wonder why the GCS needs to get all the parameters up front in order for the vehicle to fly. I'm thinking this problem might have evolved organically. Perhaps getting all parameters up front was a much smaller ask a few years ago, but as more and more parameters are added, we just tack them into the package of what gets downloaded. I notice this most because I typically use MP on the bench and QGC when I'm flying. When using QGC I operate with far fewer parameters that I care about while at the field, whereas on the bench I definitely need a lot more of them, often because MP needs them, but I"m pretty much always connected by USB when I use Mission Planner. Would it be possible to narrow this down to an initial download of a much smaller package of "essential" parameters required for flying while connected over ELRS? [I recognize this comment/conversation probably belongs somewhere else, but I don't know where that might be]. |
I've pushed up ExpressLRS/ExpressLRS#2626 on the ELRS side, for when this merges. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a rebase, its conflicting with something that has been merged recently.
@@ -155,9 +146,25 @@ void GCS_MAVLINK::ftp_error(struct pending_ftp &response, FTP_ERROR error) { | |||
// send our response back out to the system | |||
void GCS_MAVLINK::ftp_push_replies(pending_ftp &reply) | |||
{ | |||
ftp.last_send_ms = AP_HAL::millis(); // Used to detect active FTP session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want to detect the active FTP session as soon as possible in get_reschedule_interval_ms(), even if we are blocked from sending the first response because of txbuf flow control. This slows down normal streams sooner and allows the radio's buffer to drain more quickly and reduces the delay the GCS sees, which helps avoid GCS timout on the first FTP response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes sense. But its there now a risk of of FTP timing out if we sit here waiting to send for a while. Looks like the threshold is 3 seconds? Maybe we need a better way to check if there is FTP stuff happening than just looking at last_send_ms
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, since messages can be lost and the GCS may give up on FTP, testing for FTP activity will need to involve a timeout and I think this may be the simplest way. Where is the timeout 3 seconds? QGC seems to use 1 second timeouts.
bool GCS_MAVLINK::last_txbuf_is_greater(uint8_t txbuf_limit) | ||
{ | ||
if (AP_HAL::millis() - last_radio_status.received_ms > 5000) { | ||
// stale report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// stale report | |
// stale report, assume 100% buffer space |
@@ -759,6 +759,7 @@ class GCS_MAVLINK | |||
uint32_t remrssi_ms; | |||
uint8_t rssi; | |||
uint32_t received_ms; // time RADIO_STATUS received | |||
uint8_t txbuf = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new timeout we don't need to init to 100 anymore.
uint8_t txbuf = 100; | |
uint8_t txbuf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the check for 0 ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems more efficient to initialize just once than to execute the extra check every time.
@@ -827,6 +827,15 @@ float GCS_MAVLINK::telemetry_radio_rssi() | |||
return last_radio_status.rssi/254.0f; | |||
} | |||
|
|||
bool GCS_MAVLINK::last_txbuf_is_greater(uint8_t txbuf_limit) | |||
{ | |||
if (AP_HAL::millis() - last_radio_status.received_ms > 5000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (AP_HAL::millis() - last_radio_status.received_ms > 5000) { | |
if ((last_radio_status.received_ms == 0) || ((AP_HAL::millis() - last_radio_status.received_ms) > 5000)) { |
Check for zero to make sure we do the correct thing if never received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a comment pointing out that the initialization of txbuf accomplishes this be OK instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, comment is OK too. Really the key thing is to be able to understand what is going on without having to go look in the .h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a squash to remove the merge commit.
@@ -75,7 +75,7 @@ GCS_MAVLINK::queued_param_send() | |||
} | |||
count -= async_replies_sent_count; | |||
|
|||
while (count && _queued_parameter != nullptr && get_last_txbuf() > 50) { | |||
while (count && _queued_parameter != nullptr && last_txbuf_is_greater(33)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a static constexpr
for the magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we call it? one_third? Is that really better? Anything less obvious and I would be annoyed when reading the code to have to look up the constant. I considered eliminating the parameter and hard code 33 in last_txbuf_is_greater(), but I wanted to allow for other threshold values, even though I doubt we will need them.
But I'll make the change if you let me know what I should call it.
@brad112358 thanks! needs a rebase and squash. If you don't know how to do a git squash then ask on the code_review channel and I can do it |
A tlog of a test on real hardware would be useful. With my SIM (#26663) I'm seeing the FTP replies carry on for a long time after mission planner says the params have been fetched (like 30 seconds). It could be a mission planner issue of course, or we might need some fixes on top of this. I would guess mission planner gets impatient and requests reads multiple times resulting lots of duplicate requests. |
It might be useful to test with QGC Peter. I've noticed that it behaves differently (sometimes better) and having that different behavior might give different/additional data to help figure out what is going on. |
Thanks. I will learn how to do that and push again after recent review comments are resolved. |
@IamPete1 and @peterbarker If you could respond to my comments on your suggested changes, then I can clean this up. |
@@ -1142,7 +1151,7 @@ uint16_t GCS_MAVLINK::get_reschedule_interval_ms(const deferred_message_bucket_t | |||
// we are sending requests for waypoints, penalize streams: | |||
interval_ms *= 4; | |||
} | |||
if (AP_HAL::millis() - ftp.last_send_ms < 500) { | |||
if (AP_HAL::millis() - ftp.last_send_ms < 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd that we would use 1 second here but 3 seconds to time out a FTP session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only looked at QGroundControl which has a 1 second ack/nak timeout with 3 retries. Does Mission Planner use a 3 second timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of this timeout:
#define FTP_SESSION_TIMEOUT 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(our responses crossed here) OK. I think things should work OK with this changed to 3 seconds. The stream rates will just take 3 seconds to go back to normal instead of 1. Should we move the define to a header file and use FTP_SESSION_TIMEOUT instead of 500 or 1000 in GCS_Common.cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after more consideration, I think there is little reason that this timeout should be the same as the session timeout. This is essentially a heuristic and should be based on the expected behavior of the GCS, including radio latency, but doesn't need to match the session timeout which really needs to be conservative (long) as discarding the open state too early would be bad. With the current GCS behavior, I think 1 second is fine.
On the other hand, it would be easier for slower radios to make this work well if the GCS FTP retry timeout was a bit longer, say 2 seconds. Bumping this from 500 to 1000 was enough for mLRS which was already working hard to keep buffered data to a minimum. But, ELRS is even slower than mLRS. Should we increase this to 2 seconds in anticipation that some work may also be required on the GCS side to help out ELRS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that if FTP thinks the session is still active then we should still lower its data rates. However a quick test shows the MAVProxy terminates the session after param download whereas QGC and MP do not. If we get the GCS to terminate correctly there should be no problem in using the longer time out here.
Any chance of that tlog? As I say even with this I didn't get good results (although still better than without), have you been testing with Mission planner or another GCS? |
I don't use ELRS so I've only tested with mLRS and briefly with SiK and Crossfire. I'm traveling for the next few days, but I can capture tlogs with mLRS when I get back if I don't get to it sooner. Perhaps I can also test with ELRS as I do have hardware it supports, but I expect that ELRS might require some work to get it working perfectly with FTP, even with this PR. |
Ah, OK, if you can give me a rough data rate for mLRS I can set that in the simulator and see how it gets on. I will also try out QGC. Sorry for being slow, i'm keen to get #26532 so we can see the tx and rx data rates which should make testing easyer. |
The nominal down link data rates for mLRS are 1558, 2542, 4100, and 9102 bytes/sec. Up link is slightly slower to allow separate RC data. Data is sent in 82 byte chunks at 19, 31, 50, or 111 Hz. There is also a mode which can compress data, but it doesn't make a big improvement and I haven't been using it. My tests are all performed at the 19 Hz, 1558 bytes/sec rate, because I only have Frsky R9 hardware for mLRS and it is limited to this rate. Other mLRS users have tested this PR for me with the other rates. At the 1558 B/s rate, this PR improved parameter download time from about 50 seconds to about 6 seconds because FTP was able to be used with no lost messages and no retrys. The higher rates saw less dramatic improvements. Note, however, that in mLRS, Radio Status messages are not always sent at a fixed rate and the values for txbuf do not represent actual percentage, but are fudged for Ardupilot or PX4 with the goal of using between 75 and 85 percent of the available data bandwidth. This results in very low average buffer use in steady state conditions even though we allocate a fairly large, 2K byte buffer to allow clumps of messages. Usually the buffer is empty by the time the next group of messages arrives. This is quite different than a simple direct calculation of txbuf which seeks to keep the buffer utilization at around 50% full and results in higher latency than what is accomplished in mLRS, especially for very slow links. I thought I had heard that ELRS was using the same strategy as mLRS, but I see that your emulation is simply sending txbuf based on buffer used at a higher fixed rate so I'm not sure if ELRS discarded the ideas from mLRS at some point or if they have always used a fixed rate with actual buffer percentages. It is possible that the trouble you are seeing in your ELRS emulation is due to high latency because too much data is being buffered. If so, increasing the timeouts in the GCS can help, but ELRS may also need to make changes to reduce buffered data related latency. You can find the method mLRS uses here. |
Rebased, fixed the conflict, removed the merge commit and squashed the comment typo fixup commit. |
To restate my goal, this PR is just providing radios with a way to make flow control work for FTP. I think anything more is impractical given the limits of the information contained in the existing RADIO_STATUS message. I never expected that this change would accomplish optimal FTP performance with most existing radio strategies for sending txbuf updates. It just provides an opportunity (I used the phrase fighting chance for a reason) for even slow radios to make it work efficiently with some effort on the part of the radio developers. There are a few possible approaches I have experimented with on the radio side which have been successful. In the best case where the txbuf strategy on the radio side has already been designed to work well with existing Ardupilot code for non-FTP based bulk parameter download, FTP probably also works much better with this PR because it makes FTP respect txbuf the same way bulk parameter download currently does. But, mLRS is the only radio I know of that has been completely successful in developing an effective strategy on the radio side. Crossfire had some test releases that worked well for traditional bulk parameter download a couple of years ago, but they seemed to have dropped the effort when I stopped pestering them about it. Radios like SiK which send radio status only at a fixed 1 Hz rate have no hope of efficiently supporting lumpy traffic like FTP even with this PR. Until more radios can be improved to take advantage of the opportunity this PR is attempting to provide, we are stuck trying to match the serial baud rate with the radio link speed and hope the 1/3 utilization limit is effective. But mLRS does work really well with this change and I think ELRS MAVLink mode can also easily be made to work quite well with this PR since it is still in development. For the rest of the existing radios, my short term goal is only not to make things worse. In the long term, they can be made more efficient as well if anyone cares enough to try. What is the simulated link baud rate of your test? When the baud rate is low enough compared to the radio throughput, FTP will probably work, though, not optimally because of the artificial 1/3 link speed rate limiting which Ardupilot now does even without txbuf flow control. But, I normally use 230K baud for MAVLink with mLRS to minimize transmission based message latency (mLRS accumulates and parses every message at both ends of the link before starting to send it over the radio link, so a slow baud rate impacts latency even more than it might otherwise). Last I checked, ELRS was using a similar high serial baud rate, probably for the same reasons. At these higher baud rates, FTP will fail completely without some form of flow control. Also, what is the radio buffer size in your tests? I assume you are not emulating what mLRS does, but just reporting raw txbuf percentages. How often are you sending radio status? At higher baud rates you need to have a huge buffer (bad) or send status very frequently if sending at a fixed rate or (better) send an extra radio status report whenever the buffer nears full or empty to stop or start the flow immediately. If the radio uses a good strategy for when to send radio status, you should be able achieve near full radio link utilization during FTP transfers without message loss with a reasonably small buffer. But you need smarts in the radio in addition to something like this PR on the FC side. |
I have synced my local repository with the latest changes here (and from master), configured and built for one of my 19 Hz (1558 Bps) mLRS builds. I have re-run a few quick sanity tests and uploaded tlogs covering the initial connection and parameter download for QGC (Android), MP (Windows), and MAVProxy (Linux) here. I have not analyzed these logs, but the parameter upload appeared to succeed quickly as expected in all 3 tests. Sorry it took so long for me to get to this. Please let me know what you gleam from these logs. Thanks! |
In all these years, I've never bothered with SITL, but I guess I should figure out how to set it up so I can see your radio simulator work... |
Sorry, I see I forgot to enable general access to that google drive folder. I've fixed that now. |
The QGC log looks more or less as I would have expected. Looks like the mission planner on FTP failed and it grabbed params the old way. The MAVProxy one looks OK, it does FTP, but then we have a mission planner instance appear and grab param the old way. |
What are you using to examine the tlog files?
I'll try it again. I didn't time it, but it seemed to finish faster than I thought it would with bulk parameter transfer the old way (normally takes 5-8 times as long as FTP takes)
That is a bit of a mystery. How do you know it was mission planner? I used a serial to WiFi UDP bridge in AP mode connected to the Tx module for this test and MAVProxy was running on a Linux system which was supposed to be the only client connected to the bridge WiFi network and there is no mission planner installed there. I did have my other notebook still booted into windows upstairs, but it was only connected to my home WiFi and not the bridge AP because it was downloading windows updates. I do sometimes run apmplanner2 on that Linux box, but I didn't think it was still running at the time either. SITL was surprisingly trivial to get running. I created a local branch with both your radio simulator and this code but I'm not sure how to configure and enable the radio simulation and connect a second GS to it or how to make the first MAVProxy instance use the simulated radio link. Can you give me some hints? |
https://firmware.ardupilot.org/Tools/WebTools/StreamStats/ Its still very WIP, you'll probably want to turn down the window size for looking at param stuff. The MAVProxy one looks like this: We see the FTP in purple, takes a little over 10 seconds. We don't see it on the message rate plot (I think because there is only one point so there is no line) but the total plot shows a huge spike at the end of the FTP. Its too fast to actually have come over the data link I think, The top of the page shows that there were some param requests from mission planner, but they were coming from system ID 1 which is all a bit odd, maybe mission planner was connecting via UDP to the forwarding from MAVProxy or something like that: We do see radio messages from sys ID 51. So I guess that is the ground side radio. ELRs don't send on the ground side, so that must be something that MLRS does, although looks like we don't get buffer info, just noise stats. Then finaly we see the GCS on 255, More or less what we would expect, some heartbeats, stream rate requests, commands and the FTP requests. This tool does not show any of the actual data in the message, just what there is and the rates, but UAV plotter can take tlogs if you want to look inside the messages. https://plot.ardupilot.org/#/ |
You can add the "radio" to any serial port with the extra command SITL needs a GCS to run, so you would typically connect a second GCS instance to the second port. |
This gives slow radio links a fighting chance of getting FTP bulk download working even when they use a baud rate which is much higher than their current bandwidth. This should eliminate the need to disable FTP for parameter download on slow to moderate speed radio links like mLRS and ELRS. It allows removal of a hack in mLRS which results in a decrease in parameter download time for 19 Hz mLRS from 45-60 seconds to 11-17 seconds. This should also be good news for the ELRS rc-mavlink branch.
Add check for stale radio_status to get_last_txbuf() Move last_txbuf into last_radio_status struct
Delete unneeded orphan comment replace get_last_txbuf() with a predicate Make txbuf flow control threashold consistent between Parameter download and FTP and keep it in range where we are also slowing down normal streams Delay sending text banner until after first FTP response to reduce latency on slow links Don't let flow control delay setting ftp.last_send_ms so as to slow down normal streams as soon as possible to improve FTP response time
We were having no luck getting past CI, so I have rebased to see if that will clear it. |
I ran both the MP and MAVProxy tests again and snooped on the messages with wireshark. I didn't see any sign of what you observed above so I looked at the original tlogs again with StreamStats. In both cases the parameter messages you noted are reported in a huge burst all at the exact time that the FTP transfer completed. I haven't checked the source code to be sure, but I'm guessing these are just parameter details which MP and MAVProxy both place in the tlog and StreamStats is interpreting this data as messages. I'm guessing QGC doesn't inject the parameter values in the tlog, or perhaps it uses a different format that doesn't show up in StreamStats. So, I think everything looks as expected with all 3 GCS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This gives slow radio links a fighting chance of getting FTP bulk download working efficiently, even when they use a baud rate which is much higher than their current bandwidth. The strategy is similar to what Ardupilot already does for PARAM_VALUE messages.
This should eliminate the need to disable FTP for parameter download on slow to moderate speed radio links like mLRS. It allows removal of a hack in mLRS which results in a decrease in parameter download time for 19 Hz mLRS from 45-60 seconds to 11-17 seconds. Other mLRS message rates show similar improvement. This should also be good news for the ELRS rc-mavlink branch.
Also, improve get_last_txbuf() to allow message flow in the unlikely event that we stop receiving radio_status messages.
This has also been tested with TBS Crossfire and SiK radios and found to have no negative impact on parameter download.